Method for predicting the readings of japanese ideographs

ABSTRACT

System and methods allowing for effective and reliable reading predictions for Japanese ideographs are provided. In an illustrative implementation, a reading predictions system operating in “learning” and “execution/run-time” modes is provided. In the “learning” mode the reading predictions system operates on a number of input sources to produce a decision tree that is used in the “execution/run-time” mode to return reading predictions for inputted Japanese sentences containing Japanese ideographs. Among the inputs utilized in the “learning” mode are base Japanese script readings, a training corpus, and quasi-phonological rules. From these inputs underlying readings and a decision tree are created. When operating in the “execution/run-time” mode, the reading predictions system employs a morphological analyzer to perform a morphology analysis on inputted sentences. Using the morphological analysis, the quasi-phonological rules, the underlying readings, and the decision tree reading predictions are provided.

PRIORITY

[0001] This application is related to and claims priority under 35U.S.C. §119(e) to U.S. Provisional Patent Application Serial No.60/219,981, filed Jul. 21, 2000, entitled “METHOD FOR PREDICTING THEREADINGS OF JAPANESE IDEOGRAPHS,” the contents of which are herebyincorporated by reference in their entirety.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] The present invention relates to the field of predicting readingsof foreign languages, and more particularly, to the reliable andeffective reading predictions of Japanese ideographs.

[0004] 2. Brief Description of Prior Developments

[0005] The Japanese language is written using a combination of fourscripts: hiragana, katakana, romaji, and kanji. Hiragana and katakanaare syllabaries—phonetic scripts in which each character represents asyllable of a word. Generally, hiragana and katakana are collectivelyreferred to as kana. Katakana are usually reserved for writing wordsthat have been borrowed from foreign languages (except Chinese) withinthe last 400 years; they also may be used to provide emphasis or forgraphic effect. Romaji are an alphabet—the familiar Roman alphabet usedin North America, Western Europe and elsewhere. In the past, romaji havebeen used to transcribe loan words, for emphasis, and to transcribeJapanese for foreign armies of occupation. Kanji areideographs—characters that represent specific words or parts of words,rather than specific sounds. It is not the case that kanji are onlyrelated to free floating ideas, however. The link between kanji andwords is fixed, for the most part. That is, for most words, a writercannot choose between different kanji. For example, even though allJapanese speakers would agree that both the characters

and

essentially mean “dog”, it would be incomprehensible to write the word

(chuuken) “faithful dog” using the character

. Likewise, the link between words and their pronunciation is fixed.That is, dialectal variation aside, there is usually only one way topronounce a word. Thus, there is a firm link between kanji andpronunciation, but it is not a direct one—it is always mediated throughthe particular word that is being written.

[0006] Writers can however choose whether or not to use kanji at all. Itwould not be incorrect to write chuuken using hiragana (

), hiragana (

), romaji (chuuken), or a mixture (

,

). It is very common to write words (especially verbs) in a combinationof kanji and hiragana. However, any other mixture of scripts within thesame word is unusual enough to be considered an error. Because a wordthat contains kanji can also be written in a phonetic script, it ispossible to talk about the phonetic value of the kanji in that word.This is what is meant by the reading of a kanji in a particular word—itspronunciation when the word is read aloud, or its spelling in a phoneticscript when the word is written phonetically. For example, the readingof

in

is ken. However, because of the particular history of Japanese, mostkanji have at least two entirely distinct readings. For example, thereading of

in the word

(inuoyogi) is inu;

is read as nin in

(ningen), jin in

(nihonjin), and hito in

(hitobito). Furthermore, many kanji have different readings that aresystematically related to each other. For example,

is read as hatsu in

(kaihatsu), ha? in

(happyou), and patsu in

(kappatsu).

[0007] A final source of complexity when determining the underlyingreading of Japanese written language (e.g. Japanese script) is thatthere is some variation in how much of a word is represented in kanji.For example, the word kakitsuke is sometimes written as

, but at other times as

. The reading of the kanji

is ka in the first variant, kaki in the second. Both of these variantsare considered acceptable, but to mix the two variants in a singledocument is considered an error.

[0008] Given all of the above-mentioned sources of variation, predictingthe correct reading of a kanji in a given word is not a simple task.Educated native speakers of Japanese can usually remember or guess thecorrect readings of kanji, but software is less successful at performingthis task.

[0009] Currently practices in automating the reading of Japanese scriptare inefficient and can be unreliable. For example a brute forcesolution to the problem is to create a dictionary of words and link theentry for the phonetic spelling of a word to the entries for all itsother dictionary spellings. This type of solution, however, facesseveral problems. Since Japanese is traditionally written withoutinserting space between words, it is far from trivial to look words upin a dictionary. It would be necessary to first identify the boundariesbetween the words, requiring a considerable level of linguisticknowledge and an expenditure of significant resources. Because Japaneseis a more highly inflected language than English, it is quite common forword forms to be extensively modified by affixation and compounding; adictionary that contained every possible form of a word would beastonishingly large and unwieldy. As such, no dictionary could besufficiently large to adequately predict readings of Japanese script.Further, since new words are always being coined or borrowed such adictionary would have to be adaptable and updateable.

[0010] From the foregoing it is appreciated that there exists a need forsystems and methods that efficiently and reliably predict the reading ofJapanese script. By having these systems and methods, the drawbacks ofexisting practices are overcome.

SUMMARY OF THE INVENTION

[0011] A system and methods to efficiently predict readings of Japanesescript is provided. In an illustrative implementation, the presentinvention comprises a reading predictions system operating in two modes,“learning” and “execution/run-time” modes. In the “learning” mode areading analyzer accepts as input base Japanese script (i.e. kanji)readings, a training corpus (e.g. a lexicon of Japanese words and theirreadings) and quasi-phonological rules to produce an analyzed corpus andunderlying readings for each entry in the training corpus. A corpusclassifier is then invoked to produce a decision tree. In the describedimplementation, the corpus classifier employs a learning algorithm tocreate the decision tree.

[0012] When operating in the “execution/run-time” mode, a readingpredictor accepts as input the created decision tree, the generatedunderlying readings and the quasi-phonological rules. In addition, thereading predictor accepts as input a morphological analysis of inputtedJapanese sentences having Japanese ideographs. The morphologicalanalysis is created by a morphological analyzer which, among otherthings, operates to parse inputted Japanese sentences. Using theseinputs, the reading predictor produces reading predictions for theinputted Japanese sentences.

[0013] In the implementation described, the reading predictions systemis incorporated in an exemplary computing application providing stylechecking for inputted Japanese text.

DETAILED DESCRIPTION OF THE DRAWINGS

[0014] The methods and system predicting the readings of Japaneseideographs is further described with reference to the accompanyingdrawings in which:

[0015]FIG. 1 is a block diagram of an exemplary computing environment inwhich aspects of the present invention may be incorporated;

[0016]FIG. 2 is a block diagram of components cooperating to execute thelearning feature related to the effective prediction of readings ofJapanese script in accordance with the present invention;

[0017]FIG. 2A is a block diagram of components cooperating to realizethe execution of the prediction of readings of Japanese script inaccordance with the present invention;

[0018]FIG. 3 is a block diagram of exemplary processing for Japanesescript in accordance with the present invention;

[0019]FIG. 4 is a flow diagram of the processing performed to develop adecision tree for use when predicting the reading of Japanese script inaccordance with the present invention;

[0020]FIG. 4A is a flow diagram of the processing performed whenpredicting the reading of Japanese script in accordance with the presentinvention; and

[0021]FIG. 5 is a screen shot of an exemplary computing applicationhaving Japanese reading features in accordance with the presentinvention.

DETAILED DESCRIPTION OF ILLUSTRATIVE IMPLEMENTATIONS

[0022] Overview

[0023] The Japanese language is spoken by the approximately 120 millioninhabitants of Japan, and by the Japanese living in Hawaii and on theNorth and South American mainlands. It is also spoken as a secondlanguage by the Chinese and the Korean people who lived under Japaneseoccupation earlier this century.

[0024] Generally, three categories of words exist in Japanese. Thenative Japanese words constitute the largest category, followed by wordsoriginally borrowed from China in earlier history, and the smallest buta rapidly growing category of words borrowed in modern times fromWestern languages such as English. This third category also contains asmall number of words that have come from other Asian languages. Thefrequency of these three types of words varies according to the kinds ofwritten material examined. For example, in magazines, native Japanesewords constitute more than half of the total words, while the Chineseborrowed words average about 40%, and the rest drawn from the recentlyborrowed words from Western languages. In newspapers, the words ofChinese origin number greater than the Japanese native words.

[0025] Japanese has an open-syllable sound pattern, so that mostsyllables end in a vowel—the syllable may be composed solely of thevowel. There are five vowels, /a/, /i/, /u/, /e/, and /o/. Vowel lengthoften distinguishes words, as in to for “door” and too for “ten.” Thebasic consonants are: /k/, /s/, /t/, /n/, /h/, /m/, /y/, /r/, /w/, andthe syllabic nasal /N/. Many of these consonants can be palatalized infront of the vowels /a/, /u/, and /o/, for example, /kya/, /kyu/, /kyo/.When the two consonants, /s/ and /t/, occur with the vowel /i/, theseconsonants are automatically palatalized as /shi/ and /chi/. Theconsonant /t/ is pronounced as /ts/ in front of the vowel /u/. UnlikeEnglish, which has stress accent, Japanese has pitch accent, which meansthat after an accented syllable, the pitch falls. The word for“chopsticks,” hashi, has the accent on the first syllable, so its pitchcontour is ha shi. Without the accent on the first syllable, hashi maymean “bridge” or “edge.” “Bridge” has accent on the second syllable,which can be seen if a grammatical particle such as the subject markerga is attached to the word: hashi ga. “Edge” has no accent, so it wouldbe pronounced without any fall in the pitch even with a grammaticalmarker such as ga.

[0026] Every language has a basic word order for the words in asentence. In English, the sentence “Naomi uses a computer” has the ordersubject (Naomi), verb (uses), and object (a computer). In thecorresponding Japanese sentence, the subject comes first, just as inEnglish, but then the object appears, followed finally by the verb:Naomi-ga (Naomi) konpyuuta-o (computer) tukau (use). The rule of thumbin Japanese is that in a sentence, the verb comes at the end. The twoword orders, subject-verb-object for English and subject-object-verb forJapanese, are both common among the languages of the world. If we lookagain at the Japanese sentence, we see that the subject and the objectare accompanied by particles, ga with the subject “Naomi” (Naomi-ga) ando with the object “computer” (konpyuuta-o). These are called casemarkers, and a large number of the world's languages have them. We cansee a remnant of a case-marking system even in English: the pronouns inEnglish change shape depending on where it occurs, he/she/they in thesubject position, but him/her/them in the object position (e.g., She sawhim). Similarly, the older English of five hundred to one thousand yearsago had an extensive case-marking system similar to modem Japanese.These case markers make it possible for the words in Japanese to appearin different orders and retain the same meaning. In the exemplarysentence, it is possible to place the object where the subject normallyoccurs, and the subject in the normal object position, and not changethe meaning: konpyuuta-o Naomi-ga tukau. In English if the sametransposition were made, the meaning of the sentence would be radicallyaltered (e.g. The computer uses Naomi). Other variants in the Japaneselanguage make the task of transcribing from English to Japanese orvice-versa arduous at best.

[0027] Japanese is primarily written using two systems of orthography,Chinese characters and syllabaries. Chinese characters, or kanji, werebrought in from China starting about 1,500 years ago. Prior to theirintroduction, Japanese was strictly a spoken language. Chinesecharacters are by far the more difficult system because of the sheernumber of characters and the complexity both in writing and reading eachcharacter. Each character is associated with a meaning; for example, thecharacter

has the basic meaning “dog.” There are tens of thousands of charactersattested, but in 1946, the Japanese government identified 1,850characters for daily use. In 1981, the list was increased in number to1,945 characters, and given the name Joyo Kanji List (Kanji for DailyUse). The characters in the Daily Use List must be learned in primaryand secondary schools, and newspapers generally limit the use ofcharacters to this list. Most characters are associated with at leasttwo readings, the native Japanese reading, and the reading thatsimulates the original Chinese pronunciation of the same character. Ifthe same character came into Japan at different periods or fromdifferent dialect regions of China, the character may be associated withseveral Chinese readings that represent different historical periods anddialectal differences. The second system of writing is syllabaries, orkana, which were developed by the Japanese from certain Chinesecharacters about 1,000 years ago. Each character in the syllabaryrepresents a syllable in the language, and, unlike Chinese characters,it represents a sound but not meaning. There are two types ofsyllabaries, hiragana and katakana, each containing the same set ofsounds. Hiragana is often used in combination with a Chinese character,in such a way that, for example, the character represents roughly theroot of a verb, and the inflection is written with hiragana. Katakana isused to write loan words from Western languages such as English, French,and German. It is not uncommon to find kanji, hiragana, and katakanaused in the same sentence. Along with Chinese characters andsyllabaries, the Roman alphabet is sometimes employed for such things asnames of organizations. Given this complex situation, it is notdifficult to imagine that the reliable reading of Japanese scripts canbe arduous at best.

[0028] The present invention addresses the challenge of readingprediction by identifying a minimal set of underlying readings for eachkanji, defining a set of quasi-phonological rules which operate on theunderlying readings in order to produce a surface reading, andconstructing a decision tree data structure that is used to determinewhich underlying reading should be chosen for each kanji in a word. Theunderlying readings consist of a literal reading and a set of data thatcontrols the operation of the quasi-phonological rules. The decisiontree allows the algorithm to choose the most likely reading for a kanji,based only on information obtained during the morphological analysis ofthe word in which it is found.

[0029] The set of underlying readings and the decision tree are learnedautomatically from a set of linguistic resources including lexical,morphological, and phonological information. The construction of theoptimal set of readings and tree enables reading prediction to be madeefficiently.

[0030] As will be described below with respect to FIGS. 1-5, the presentinvention is directed to a system and methods for effectively andreliably predicting readings for Japanese scripts. In accordance with anillustrative implementation thereof, the present invention comprises asystem and method to provide content providers with data in a preferreddata type.

[0031] In one embodiment, described more fully hereinafter, the methodsand apparatus of the present invention may be implemented as part of acomputing environment executing one or more components directed to thereading and analysis of Japanese script. The computing environment maycomprise various hardware and software combinations to realize thereading of Japanese scripts.

[0032] Exemplary Computing Environment

[0033]FIG. 1 illustrates an example of a suitable computing systemenvironment 100 in which the invention may be implemented. The computingsystem environment 100 is only one example of a suitable computingenvironment and is not intended to suggest any limitation as to thescope of use or functionality of the invention. Neither should thecomputing environment 100 be interpreted as having any dependency orrequirement relating to any one or combination of components illustratedin the exemplary operating environment 100.

[0034] The invention is operational with numerous other general purposeor special purpose computing system environments or configurations.Examples of well known computing systems, environments, and/orconfigurations that may be suitable for use with the invention include,but are not limited to, personal computers, server computers, hand-heldor laptop devices, multiprocessor systems, microprocessor-based systems,set top boxes, programmable consumer electronics, network PCs,minicomputers, mainframe computers, distributed computing environmentsthat include any of the above systems or devices, and the like.

[0035] The invention may be described in the general context ofcomputer-executable instructions, such as program modules, beingexecuted by a computer. Generally, program modules include routines,programs, objects, components, data structures, etc. that performparticular tasks or implement particular abstract data types. Theinvention may also be practiced in distributed computing environmentswhere tasks are performed by remote processing devices that are linkedthrough a communications network or other data transmission medium. In adistributed computing environment, program modules and other data may belocated in both local and remote computer storage media including memorystorage devices.

[0036] With reference to FIG. 1, an exemplary system for implementingthe invention includes a general purpose computing device in the form ofa computer 110. Components of computer 110 may include, but are notlimited to, a processing unit 120, a system memory 130, and a system bus121 that couples various system components including the system memoryto the processing unit 120. The system bus 121 may be any of severaltypes of bus structures including a memory bus or memory controller, aperipheral bus, and a local bus using any of a variety of busarchitectures. By way of example, and not limitation, such architecturesinclude Industry Standard Architecture (ISA) bus, Micro ChannelArchitecture (MCA) bus, Enhanced ISA (EISA) bus, Video ElectronicsStandards Association (VESA) local bus, and Peripheral ComponentInterconnect (PCI) bus (also known as Mezzanine bus).

[0037] Computer 110 typically includes a variety of computer readablemedia. Computer readable media can be any available media that can beaccessed by computer 110 and includes both volatile and nonvolatilemedia, removable and non-removable media. By way of example, and notlimitation, computer readable media may comprise computer storage mediaand communication media. Computer storage media includes both volatileand nonvolatile, removable and non-removable media implemented in anymethod or technology for storage of information such as computerreadable instructions, data structures, program modules or other data.Computer storage media includes, but is not limited to, RAM, ROM,EEPROM, flash memory or other memory technology, CDROM, digitalversatile disks (DVD) or other optical disk storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can be accessed by computer 110. Communication media typicallyembodies computer readable instructions, data structures, programmodules or other data in a modulated data signal such as a carrier waveor other transport mechanism and includes any information deliverymedia. The term “modulated data signal” means a signal that has one ormore of its characteristics set or changed in such a manner as to encodeinformation in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, RF,infrared and other wireless media. Combinations of any of the aboveshould also be included within the scope of computer readable media.

[0038] The system memory 130 includes computer storage media in the formof volatile and/or nonvolatile memory such as read only memory (ROM) 131and random access memory (RAM) 132. A basic input/output system 133(BIOS), containing the basic routines that help to transfer informationbetween elements within computer 110, such as during start-up, istypically stored in ROM 131. RAM 132 typically contains data and/orprogram modules that are immediately accessible to and/or presentlybeing operated on by processing unit 120. By way of example, and notlimitation, FIG. 1 illustrates operating system 134, applicationprograms 135, other program modules 136, and program data 137.

[0039] The computer 110 may also include other removable/non-removable,volatile/nonvolatile computer storage media. By way of example only,FIG. 1 illustrates a hard disk drive 140 that reads from or writes tonon-removable, nonvolatile magnetic media, a magnetic disk drive 151that reads from or writes to a removable, nonvolatile magnetic disk 152,and an optical disk drive 155 that reads from or writes to a removable,nonvolatile optical disk 156, such as a CD ROM or other optical media.Other removable/non-removable, volatile/nonvolatile computer storagemedia that can be used in the exemplary operating environment include,but are not limited to, magnetic tape cassettes, flash memory cards,digital versatile disks, digital video tape, solid state RAM, solidstate ROM, and the like. The hard disk drive 141 is typically connectedto the system bus 121 through an non-removable memory interface such asinterface 140, and magnetic disk drive 151 and optical disk drive 155are typically connected to the system bus 121 by a removable memoryinterface, such as interface 150.

[0040] The drives and their associated computer storage media discussedabove and illustrated in FIG. 1, provide storage of computer readableinstructions, data structures, program modules and other data for thecomputer 110. In FIG. 1, for example, hard disk drive 141 is illustratedas storing operating system 144, application programs 145, other programmodules 146, and program data 147. Note that these components can eitherbe the same as or different from operating system 134, applicationprograms 135, other program modules 136, and program data 137. Operatingsystem 144, application programs 145, other program modules 146, andprogram data 147 are given different numbers here to illustrate that, ata minimum, they are different copies. A user may enter commands andinformation into the computer 20 through input devices such as akeyboard 162 and pointing device 161, commonly referred to as a mouse,trackball or touch pad. Other input devices (not shown) may include amicrophone, joystick, game pad, satellite dish, scanner, or the like.These and other input devices are often connected to the processing unit120 through a user input interface 160 that is coupled to the systembus, but may be connected by other interface and bus structures, such asa parallel port, game port or a universal serial bus (USB). A monitor191 or other type of display device is also connected to the system bus121 via an interface, such as a video interface 190. In addition to themonitor, computers may also include other peripheral output devices suchas speakers 197 and printer 196, which may be connected through anoutput peripheral interface 190.

[0041] The computer 110 may operate in a networked environment usinglogical connections to one or more remote computers, such as a remotecomputer 180. The remote computer 180 may be a personal computer, aserver, a router, a network PC, a peer device or other common networknode, and typically includes many or all of the elements described aboverelative to the computer 110, although only a memory storage device 181has been illustrated in FIG. 1. The logical connections depicted in FIG.1 include a local area network (LAN) 171 and a wide area network (WAN)173, but may also include other networks. Such networking environmentsare commonplace in offices, enterprise-wide computer networks, intranetsand the Internet.

[0042] When used in a LAN networking environment, the computer 110 isconnected to the LAN 171 through a network interface or adapter 170.When used in a WAN networking environment, the computer 110 typicallyincludes a modem 172 or other means for establishing communications overthe WAN 173, such as the Internet. The modem 172, which may be internalor external, may be connected to the system bus 121 via the user inputinterface 160, or other appropriate mechanism. In a networkedenvironment, program modules depicted relative to the computer 110, orportions thereof, may be stored in the remote memory storage device. Byway of example, and not limitation, FIG. 1 illustrates remoteapplication programs 185 as residing on memory device 181. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

[0043] Predicting Readings of Ideographs

[0044]FIGS. 2 and 2A show the cooperation of various data and processingcomponents of reading prediction system 200 to generate underlyingreadings and a decision tree for use when predicting readings ofJapanese ideographs. In an illustrative implementation, readingprediction system 200 comprises base kanji readings 205, training corpus210, quasi-phonological rules 215, reading analyzer 220, underlyingreadings 225, analyzed corpus 230, corpus classifier 235, decision tree240, input sentences 270, reading predictor 265, morphological analyzer275, morphological analysis 280, and reading predictions 260. Readingpredictions system 200 operates in two phases, a “learning phase” and an“execution/run-time phase.” FIG. 2 shows the cooperation of illustrativecomponents for the “learning phase” of reading predictions system 200.The “learning phase” provides reading prediction system 200 withdecision tree 240 and underlying readings 225 that are used (along withother illustrative components as shown in FIG. 2A) during the“execution/run-time” phase to provide reading predictions.

[0045] As shown in FIG. 2, reading analyzer 220 accepts as input basekanji readings 205, training corpus 210, and quasi-phonological rules215. Using these data, reading analyzer 220 creates analyzed corpus 230and underlying readings 225. Analyzed corpus 230 acts as input to corpusclassifier 235 that in turn generates decision tree 240. Further, asshown processing is passed from reading analyzer 220 to corpusclassifier 235 once underlying readings 225 and analyzed corpus 230 aregenerated. Using decision tree 240 and underlying readings 225, readingpredictions system 200 can provide reading predictions 260 during the“execution/run-time phase.” As shown in FIG. 2A, reading predictor 265accepts as input sentences 270, decision tree 240, underlying readings225, quasi-phonological rules 215, morphological analysis 280 to producereading predictions 260. In operation, input sentences are operated onby reading predictor 265 and morphological analyzer 275. Morphologicalanalyzer 275 operates on input sentences 270 to produce morphologicalanalysis 280. Morphological analyzer 275 is better described in U.S.Pat. Nos. 5,963,893 and 5,946,648, assigned to Microsoft Corp., theassignee of the present invention, both of which are herein incorporatedby reference in their entirety. In turn, morphological analysis 280 actsas input to reading predictor 265 that is used to process inputsentences 270 by reading predictor 265.

[0046] Specifically, reading prediction system 200 starts with acomplete list of the base readings of each kanji. The base readingscontain only information about pronunciation and the historical class ofthe reading. Readings are divided into two classes based on whether thereading was originally borrowed from Chinese (an on reading) or wascreated expressly for Japanese (a kun reading). This information wasoriginally taken from a machine-readable dictionary of Japanese thatMicrosoft has purchased; the list was subsequently modified as necessaryto improve the performance of the prediction procedure. The basereadings are stored in a text file, which is read by the trainingprogram.

[0047] In the illustrative example that follows, the readings of kanjiand words/morphemes are represented in romaji for the convenience of thereader. However, in the actual data, the readings are always written inhiragana. Accordingly, the examples will thus refer to “the first kanaof the reading” and so forth. For example, the base reading characterdata for

are as follows:

[0048] hatsu, on reading

[0049] hotsu, on reading

[0050] abaki, kun reading

[0051] okoshi, kun reading

[0052] tachi, kun reading

[0053] hasshi, kun reading

[0054] hana, kun reading

[0055] hira, kun reading

[0056] Also present at the beginning is the complete list ofquasi-phonological rules. The rules specify that readings undergocertain modifications when they occur in specific environments. Theserules encapsulate both purely phonological phenomena such as weak voweldeletion and Lyman's Law as well as purely orthographical phenomena suchas the practice of spelling part of the reading in kana (okurigana).Each rule is implemented as an environment to be matched (the “left handside” of the rule) and an action to be taken (the “right hand side”). Aportion of the rules can be paraphrased as follows:

[0057] If a kana is part of a kun reading and it is the first kana in amorpheme, and it follows a syllabic nasal kana, and it begins with anunvoiced consonant, and the remainder of the morpheme does not contain avoiced obstruent, then replace the unvoiced consonant with its voicedcounterpart.

[0058] If a reading ends with the underlying? phoneme, delete thephoneme and double the initial consonant of the reading that follows.

[0059] If a reading has more than two kana, remove the last two kana.

[0060] The rules always apply in a fixed order and cannot apply to theirown output. Furthermore some rules, when applied, forbid the applicationof any further rules.

[0061] A corpus of training data is assembled which includes all thewords in the main lexicon of the Japanese morphological analyzer, all ofthe morphemes in the finite state grammar of the analyzer, a list ofknown non-standard spelling variants, and a list of typical numbers anddates. Each entry includes the item's spelling, its morphologicalcategory or part of speech, and the item's reading. The corpus isprocessed into several text files which are processed by readinganalyzer 220 of FIG. 2.

[0062] A portion of exemplary data contained in the corpus is asfollows:

[0063] GOku, aba,

[0064] GOsu, oko,

[0065] GOsu, ha?,

[0066] GOtu, ta,

[0067] Geo, hassamu,

[0068] Lnme, hossa,

[0069] Noun, kappatsu,

[0070] Noun, hatsumei,

[0071] Noun, ichinenhokki,

[0072] Noun, kanpatsu,

[0073] Noun, kanpatsu,

[0074] Noun, hokku,

[0075] Noun, hotsui,

[0076] DER_class_shot_hatu, ippatsu,

[0077] DER_class_shot_hatu, nihatsu,

[0078] DER_class_shot_hatu, sanpatsu,

[0079] During the “learning phase” each entry of the training corpus isanalyzed to determine for each kanji in each word which base reading isused, which phonological rules applied, and which rules could haveapplied but did not. This step is realized by performing an exhaustivesearch of possible combinations, and finding those that produce areading that matches the entry's reading. Illustrative processing is asfollows:

[0080] For each entry in the training corpus

[0081] For each kanji in the spelling

[0082] For each of the kanji's base readings

[0083] Substitute the base reading for the kanji to form a readinghypothesis

[0084] For each reading hypothesis

[0085] For each phonological rule with an environment that is matched

[0086] Duplicate the current reading hypothesis

[0087] In one copy, perform the action part of the rule and mark thatthe rule was applied.

[0088] In the other copy, mark that the rule was blocked.

[0089] If a reading hypothesis matches the reading of the entry, savethe hypothesis

[0090] It is possible for reading analyzer 220 to produce more than onesuccessful hypothesis, or to produce none at all. In the case ofmultiple successful hypotheses, the reading prediction system choosesthe best hypothesis using heuristics that favor simpler hypotheses. Byexamining the output of the “learning phase,” the set of base readingsand phonological rules can be modified to reduce the number ofambiguities and failures.

[0091] As an example of typical operation, during the “learning phase”the following entry may be analyzed as follows:

[0092] Noun, kanpatsu,

[0093] The character

has just one base reading:

[0094] kan, on

[0095] Combined with the eight base readings for

enumerated above, this produces eight reading hypotheses beforephonological rules are applied: kanhatsu, kanhotsu, kanabaki, kanokoshi,kantachi, kanhasshi, kanhana, and kanhira. Reading analyzer 220 (thealgorithm executed by reading analyzer) finds that kanhatsu matches theenvironment for a rule called NasalVoicing, which voices consonantsafter a syllabic nasal. Applying this rule would produce kanbatsu, andno subsequent combination of rule applications leads to the correctreading. However, if NasalVoicing is blocked then the hypothesis matchesthe environment for another rule, NasalStopping. Applying this ruleproduces kanpatsu. A later rule, SpellingVariant1, would change kanpatsuto kanpa; when this rule is blocked, the final hypothesis remainskanpatsu, which is the correct surface reading.

[0096] The reading hypotheses are converted into underlying readings bya straightforward method. It is assumed that every phonological rulewill apply when its environment is matched, unless it is blocked. Theunderlying reading thus needs only to record which rules are blocked.For the above example, the underlying readings are thus:

[0097]

—kan, on, -NasalVoicing

[0098]

—hatsu, on, -SpellingVariant1

[0099] After analyzing the entire training corpus in this fashion,reading predictions system 200 has identified the complete set ofunderlying readings 225 for each kanji, and the complete set of wordswhere each reading has appeared. Reading predictions system 200 usesthis information to create decision tree 240 for each kanji; decisiontree 240 predicts the underlying reading of the kanji in a givencontext. Decision tree 240 uses only information that will be availablefrom the morphological analysis of a sentence. Stated differently,decision tree 240 can make a prediction about the underlying readings ofwords regardless of whether the words occurred in the training corpus.

[0100] In an illustrative implementation, decision tree 240 is createdusing a variant of the well-known ID3 machine learning algorithm. Thatis, each word is treated as an event, the outcome of which (the correctunderlying reading) is known. The algorithm attempts to classify theevents into subsets which all have the same outcome. It does so bydividing the set of events into subsets where each member of the subsethas the same value of a classification attribute, where the attribute issomething known about the event other than the outcome. By calculatingthe entropy of each set before and after being divided, the algorithm isprovided with a metric called entropy gain. The algorithm searches forthe sequence of attribute tests that maximizes the entropy gain at eachdivision, and creates a sequence of tests that eventually classifies theevents into homogeneous subsets sharing the same outcome.

[0101] During the “learning phase” reading predictions system 200employs classification attributes which is the information availablefrom morphological analysis. The set includes attributes such as:

[0102] IsBoundMorpheme—true if the morpheme containing kanji is an affix

[0103] IsStemMorpheme—true if the morpheme containing the kanji is astem

[0104] IsMorphInitial—true if the kanji is the first character in themorpheme

[0105] IsMorphFinal—true if the kanji is the last character in themorpheme

[0106] PrecedesKanji—true if the kanji immediately precedes anotherkanji in the morpheme

[0107] Follows Kanji—true if the kanji immediately follows another kanjiin the morpheme

[0108] Precedes Hiragana—true if the kanji immediately precedes ahiragana in the morpheme

[0109] FollowsHiragana—true if the kanji immediately follows a hiraganain the morpheme

[0110] PrecedesKatakana—true if the kanji immediately precedes akatakana in the morpheme

[0111] Follows Katakana—true if the kanji immediately follows a katakanain the morpheme

[0112] AllKanji—true if all the characters in the morpheme containingthe kanji are kanji

[0113] IsUnigram—true if the morpheme containing the kanji is only onecharacter long

[0114] IsBigram—true if the morpheme containing the kanji is twocharacters long

[0115] IsTrigram—true if the morpheme containing the kanji is threecharacters long

[0116] IsTetragram—true if the morpheme containing the kanji is fourcharacters long

[0117] IsFactoid—true if the morpheme containing the kanji is a name,date, or number

[0118] IsBoundR—true if the morpheme containing the kanji is a onecharacter suffix

[0119] IsBoundL—true if the morpheme containing the kanji is a onecharacter prefix

[0120] MorphIDEquals(X)—true if the morpheme containing the kanji is X

[0121] WordIDEquals(X)—true if the word containing the kanji is X

[0122] NextCharEquals(X)—true if the kanji immediately precedes X in themorpheme

[0123] ThirdCharEquals(X)—true if the kanji precedes X by two charactersin the morpheme

[0124] PrevCharEquals(X)—true if the kanji immediately follows X in themorpheme

[0125] Using the classification attributes reading predictions system200 would operate on the following examples as follows. For example,suppose that the only instances of

in the training corpus were:

[0126] 1. GOku, aba,

[0127] 2. GOsu, oko,

[0128] 3. Noun, kappatsu,

[0129] 4. NCna, hatsumei,

[0130] 5. Noun, ichinenhokki,

[0131] 6. Noun, kanpatsu,

[0132] 7. Noun, hokku,

[0133] 8. Noun, hotsui,

[0134] The underlying readings of

identified by the analysis phase would be:

[0135] 1. A: aba, kun, -SpellingVariant1

[0136] 2. B: oko, kun, -SpellingVariant1

[0137] 3. C: hatsu, on, -SpellingVariant1

[0138] 4. C: hatsu, on, -SpellingVariant1

[0139] 5. D: hotsu, on

[0140] 6. C: hatsu, on, -SpellingVariant1

[0141] 7. D: hotsu, on

[0142] 8. E: hotsu, on, -SpellingVariant1

[0143] The reading analyzer algorithm would create a decision tree like:If_IsMorphID(GOku) Reading A Else If_IsMorphID(GOsu) Reading B ElseIf_IsFinal Reading C Else If_IsTetragram Reading D ElseIf_IsMorphID(Ncna) Reading C Else If_NextCharEquals(

) Reading D Else Reading E

[0144] In some cases the classification attributes cannot completelyseparate the words into homogenous classes. When this situation arises,the algorithm performs the final separation probabilistically, based onthe frequencies of the examples, which is calculated from thefrequencies of the words in the training corpus. If the example dataabove also included the item:

[0145] 9. Noun, hatsui,

(reading C)

[0146] and both items 8 and 9 had the same frequency, the final piece ofthe above tree would be replaced by:

[0147] If_NextCharEquals(

)

[0148] Reading D

[0149] Else

[0150] Probabilistic

[0151] 0.5 Reading E

[0152] 0.5 Reading C

[0153] In order to maximize speed in the “execution/run-time” phase,most of the work is done during the “learning” phase. During the“execution/run-time” phase, the reading prediction algorithm isimplemented as a module within an exemplary computing application (asshown in FIG. 5), which also contains the Japanese morphology analyzer.To predict the reading for a given kanji, the morphology engine is usedto analyze the sentence that contains the word that contains the kanji.The values of the classification attributes are calculated from theanalysis and then used to walk through the decision tree to find theunderlying reading for the kanji.

[0154] Then the phonological rules are applied to the underlyingreadings, (unless they are blocked by the underlying reading) to producethe surface form of the reading. A confidence level is also calculatedfor the surface reading; if the traversal of the decision treeencountered a probabilistic node, the confidence level will reflect theprobability of the paths followed. If the reading prediction module iscalled repeatedly for the same input words, it will return all thedifferent possible predictions in order of decreasing confidence.

[0155]FIG. 3 shows the general steps that are performed by readingpredictions system 200 to analyze and provide reading predictions for anexemplary sentence. As shown, to determine the reading of the word ofthe word

(305) in the sentence:

[0156]

. (300)

[0157] The sentence is first analyzed by morphological analyzer 275 ofFIG. 2A, revealing the structure:

[0158]

(Pronoun)

(Particle)

(Noun Complement)

(Copula).(310)

[0159] Then the classification attributes for the two kanji

and

are calculated. The decision trees for each of the two kanji are thenwalked through, according to the values of the attributes. Theunderlying readings (315):

[0160] hatsu, on, -SpellingVariant1

[0161] mei, on, -SpellingVariant1

[0162] are selected, and a representation of the word reading hatsumeiis created. Then the phonological rules are applied to the word reading,and since the only rule with an environment that matches isSpellingVariant1, and that rule is blocked from applying to bothreadings, the final surface reading prediction is hatsumei.

[0163]FIG. 4 shows in more detail the processing performed by readingpredictions system 200 when operating in the “learning phase”.Processing begins at block 400 and proceeds to block 405 where Japanesereading data is loaded onto reading predictions system 200. In anillustrative implementation, Japanese reading data comprises a set ofstandard kanji readings, including their classification as on or kunreadings. From there quasi-phonological rules are loaded onto readingpredictions system 200 at block 410. Then the corpus of Japanese data415 is loaded onto reading predictions system 200. The corpus ofJapanese data is comprised of entries from a Japanese dictionary,morphemes from a Japanese finite-state grammar, and a set of Japanesephrases such as numbers and dates. Each item includes a spelling, areading, and a part of speech or morphological category. A base readingis then assigned to each entry of the Japanese data corpus at block 420.Processing then proceeds to block 425 where a reading hypothesis isdeveloped for each entry of the Japanese data corpus. The developedhypotheses of block 425 are then converted to underlying readings atblock 430. Using the underlying readings, reading predictions system 200creates a decision tree that is used in the “execution/run-time phase”of reading predictions system 200. The decision tree having beengenerated, processing terminates at block 440.

[0164]FIG. 4A shows the processing performed by reading predictionsystem when operating in the “execution/run-time” mode/phase. As shown,processing begins at block 445 and proceeds to block 450 where aninputted sentence is analyzed using a morphology analyzer. From thereprocessing proceeds to block 455 where the classification attributes ofthe Japanese ideographs present in the inputted sentence are calculated.Using the classification attributes, the decision tree (generated inblock 435 of FIG. 4) is “walked” to determine underlying reading ofJapanese ideographs (kanji), as well as a confidence level for theprediction. A surface form reading is then produced at block 465 byapplying phonological rules to the created underlying reading. Thesurface forms are returned in order of decreasing confidence at block470. Processing then terminates at block 475.

[0165]FIG. 5 shows a screen shot of an exemplary computing applicationhaving incorporated therein features of the present invention. Exemplarycomputing application 500 comprises display/interface pane havingdisplay/interface controls 510 and display/interface area 515. As shown,Japanese ideographs (i.e. kanji script) 520 can be displayed indisplay/interface area 520. In operation, exemplary computingapplication 500 may employ features of the present invention to performa style check on inputted Japanese ideographs (e.g. 520) to ensureproper usage of the inputted Japanese ideographs in proffered Japanesesentences. Such operation may be realized through the use of a “StyleChecker” feature in exemplary computing application. The “Style Checker”may be incorporated as one of display/interface controls 510 such thatwhen Japanese sentences (i.e. Japanese sentences having words comprisedof Japanese ideographs) are inputted for display on display/interfacearea 515, the “Style Checker” having incorporated reading predictionssystem (of FIGS. 2 and 2A), can process the inputted Japanese sentencesand confirm consistent usage of inputted Japanese ideographs.

[0166] In sum, the present invention provides a system and methodsallowing for effective and reliable reading predictions for Japaneseideographs. It is understood, however, that the invention is susceptibleto various modifications and alternative constructions. There is nointention to limit the invention to the specific constructions describedherein. On the contrary, the invention is intended to cover allmodifications, alternative constructions, and equivalents falling withinthe scope and spirit of the invention.

[0167] It should also be noted that the present invention may beimplemented in a variety of computer systems. The various techniquesdescribed herein may be implemented in hardware or software, or acombination of both. Preferably, the techniques are implemented incomputer programs executing on programmable computers that each includea processor, a storage medium readable by the processor (includingvolatile and non-volatile memory and/or storage elements), at least oneinput device, and at least one output device. Program code is applied todata entered using the input device to perform the functions describedabove and to generate output information. The output information isapplied to one or more output devices. Each program is preferablyimplemented in a high level procedural or object oriented programminglanguage to communicate with a computer system. However, the programscan be implemented in assembly or machine language, if desired. In anycase, the language may be a compiled or interpreted language. Each suchcomputer program is preferably stored on a storage medium or device(e.g., ROM or magnetic disk) that is readable by a general or specialpurpose programmable computer for configuring and operating the computerwhen the storage medium or device is read by the computer to perform theprocedures described above. The system may also be considered to beimplemented as a computer-readable storage medium, configured with acomputer program, where the storage medium so configured causes acomputer to operate in a specific and predefined manner. Further, thestorage elements of the exemplary computing applications may berelational or sequential (flat file) type computing databases that arecapable of storing data in various combinations and configurations.

[0168] Although exemplary embodiments of the invention have beendescribed in detail above, those skilled in the art will readilyappreciate that many additional modifications are possible in theexemplary embodiments without materially departing from the novelteachings and advantages of the invention. Accordingly, these and allsuch modifications are intended to be included within the scope of thisinvention construed in breadth and scope in accordance with the appendedclaims.

What is claimed is:
 1. A method to predict the reading of Japaneseideographs of Japanese words and/or sentences comprising the steps of:creating underlying readings for a data store having Japanese words withJapanese ideographs, said underlying readings created employing datacomprising any of base kanji readings and quasi-phonological rules;generating a decision tree, said decision tree setting forth steps forpredicting readings of said Japanese ideographs; and processing saidJapanese words and/or sentences to provide readings of said Japaneseideographs of said Japanese words and/or sentences.
 2. The method asrecited in claim 1, wherein said creating step further comprises thestep of providing a reading analyzer, said reading analyzer accepting asinput said base kanji readings, said quasi-phonological rules, and atraining corpus for processing to create said underlying readings,wherein said training corpus comprises said data store having Japanesewords with Japanese ideographs.
 3. The method as recited in claim 1,wherein said generating step further comprises the step of providing alearning algorithm, said learning algorithm setting forth steps tocreating said decision tree.
 4. The method as recited in claim 3,wherein said providing step comprises the step of furnishing an ID3-typemachine learning algorithm.
 5. The method as recited in claim 4, furthercomprising the steps of: treating each Japanese ideograph in eachJapanese word of said data store having Japanese words with Japaneseideographs as an event, wherein the outcome of each event is the correctunderlying reading of said each Japanese ideograph in said Japaneseword; classifying said events into sets having the same outcome, whereinsaid classifying step further comprises the steps of dividing said setsinto subsets where each member of said subsets has the same value of aclassification attribute, wherein said classification attribute is aknown fact about the event other than the outcome; calculating theentropy of each set before and after being divided to produce an entropygain; and searching for the sequence of attribute tests that maximizesthe entropy gain at each division to create a sequence of tests thatclassifies the events into homogenous subsets sharing the same outcome.6. The method as recited in claim 1, wherein said processing stepfurther comprises the step of: accepting as input various data sourcescomprising any of said decision tree, said underlying readings, saidquasi-phonological rules, and morphological analysis by a readingpredictor, said reading predictor using said data sources to parseJapanese words and/or sentences to identify Japanese ideographs andtheir respective readings, wherein said morphological analysis isproduced by a morphological analyzer using linguistic morphology rules.7. The method as recited in claim 6, further comprising the steps of:analyzing Japanese words and/or sentences by morphological analyzer todetermine their structure, wherein said structure comprising Japaneseideographs; calculating the classification attributes for said Japaneseideographs; walking said decision tree according to the value of saidcalculated attributes; selecting the appropriate underlying readings forsaid Japanese ideographs; and applying said quasi-phonological rules tosaid underlying readings to produce surface readings.
 8. A computerreadable storage medium comprising computer-executable instructions forinstructing a computer to perform the acts recited in claim
 1. 9. Asystem to predict readings of Japanese ideographs comprising: a Japanesereading analyzer, said reading analyzer accepting Japanese language dataas input to produce underlying readings for Japanese ideographs in saidcorpus of Japanese words, and a decision tree used in predicting thereading of Japanese ideographs; and a Japanese reading predictor, saidreading predictor accepting said produced decision tree, said Japaneselanguage data, and a morphological analysis as input to operate onJapanese words and/or sentences to provide reading predictions forJapanese ideographs present in said inputted Japanese words and/orsentences.
 10. The system as recited in claim 9, wherein said Japaneselanguage data comprises any of basic kanji readings, a corpus ofJapanese words and morphemes, and quasi-phonological rules.
 11. Thesystem as recited in claim 9, wherein said morphological analysis iscreated by a morphological analyzer, said morphological analyzer havingthe ability to process Japanese words and/or sentences according topre-defined Japanese language morphology rules.
 12. The system asrecited in claim 10, wherein said morphological analyzer accepts asinput Japanese words and/or sentences to calculate classificationattributes for Japanese ideographs present in said inputted Japanesewords and/or sentences, wherein said classification attributes assistsaid reading predictor to create a surface reading for Japaneseideographs in said inputted Japanese words and/or sentences.
 13. Thesystem as recited in claim 12, wherein said classification attributescomprises any of: IsBoundMorpheme, IsStemMorpheme, IsMorphInitial,IsMorphFinal, PrecedesKanji, FollowsKanji, PrecedesHiragana,FollowsHiragana, PrecedesKatakana, FollowsKatakana, AllKanji, IsUnigram,IsBigram, IsTrigram, IsTetragram, IsFactoid, IsBoundR, IsBoundL,MorphIDEquals(X), WordIDEquals(X), NextCharEquals(X),ThirdCharEquals(X), and PrevCharEquals(X).
 14. The system as recited inclaim 13, wherein said classification attributes are rooted in Japaneselinguistic rules.
 15. The system as recited in 9, wherein said readinganalyzer comprises a learning algorithm, said learning algorithmproviding steps to facilitate the creation of said decision tree. 16.The system as recited in claim 15, wherein said learning algorithm is anID3-type machine learning algorithm.
 17. The system as recited in claim9, wherein said system is incorporated as part of a computingapplication, said computing application providing features that allowfor the reading of Japanese ideographs for style checking.
 18. A methodto allow for effective and reliable reading predictions of Japaneseideographs performing the acts of: providing a reading analyzer, saidreading analyzer accepting as input various Japanese language data;operating said reading analyzer in a learning mode, wherein said readinganalyzer operates on said inputted data to produce underlying readingsfor said Japanese language data and to generate a decision tree for usewhen predicting readings of Japanese ideographs; providing a readingpredictor, said reading predictor employing said produced underlyingreadings and said generated decision tree to determine characteristicsfor Japanese ideographs in inputted Japanese words and/or sentences,wherein said characteristics contribute to the prediction of readingsfor said Japanese ideographs.
 19. The method as recited in claim 18,wherein said providing said reading analyzer act further comprises theact of providing Japanese language data comprising any of base kanjireadings, Japanese lexicon, and quasi-morphological rules.
 20. Themethod as recited in claim 18, wherein said providing said readingpredictor act further comprises the act of furnishing a morphologicalanalysis for said inputted Japanese words and/or sentences, saidmorphological analysis generated by a morphological analyzer operatingon said inputted Japanese words and/or sentences using Japaneselinguistic morphology rules.
 21. A computer readable storage mediumcomprising computer-executable instructions for instructing a computerto perform the acts recited in claim
 18. 22. In a computer system havingstorage, a method of representing analysis of an input string of naturallanguage characters useful to identify readings of said characterscomprising the portions of the input string, comprising thecomputer-implemented steps of: processing the input string to identifythe natural language characters in the string and morphemes in thestring; and creating a structure in storage that holds characteristicsof said natural language characters, such that the structure may be usedto identify the readings of said natural language characters thatcomprise said input string, said characteristics representative of adecision tree comprising connected nodes including root and leaves,wherein each path of the decision tree from the root to a leafrepresents an alternative reading analysis for said natural characters.23. The method as recited in claim 22, wherein the input stringcomprises Japanese characters having Japanese ideographs.
 24. The methodas recited in claim 22, wherein the step of processing said input stringcomprises processing the input string using linguistic morphology rules.25. The method as recited in claim 24 further comprising the step ofprocessing said input string by a morphological analyzer.
 26. The methodas recited in claim 22, wherein the step of creating said structurecomprises employing a learning algorithm to generate said decision tree.